-
Notifications
You must be signed in to change notification settings - Fork 423
[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled #1926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
84a48de
to
1a2358b
Compare
Do you have any relevant performance data? |
9f57e65
to
73b7ef7
Compare
Hello, I have use this PR, based on the version as folow: from user code: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information |
torch in my env is torch==2.7.1 and torch_npu==2.7.1rc1 |
i validate this feature in A2. |
73b7ef7
to
0f4dc6a
Compare
Thanks very much, I want to use this feature on A3, and my target is to decease decode latency. I saw your feature is working in repfill phase, if it also works on decode phase? |
a54cfb1
to
4547ae7
Compare
46de3a3
to
0125904
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1926 +/- ##
==========================================
+ Coverage 71.73% 72.00% +0.26%
==========================================
Files 96 98 +2
Lines 10719 10843 +124
==========================================
+ Hits 7689 7807 +118
- Misses 3030 3036 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0125904
to
3c987dc
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
3c987dc
to
b18b5e2
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
…l is enabled Signed-off-by: Ronald1995 <[email protected]>
356b7c1
to
7302640
Compare
So this feature rely on pta 2.7.1, right? If so, I think we should merge this after pta is upgrade on main branch |
this feature doesn't rely on torch2.7.1, the npu_mm_all_reduce_base is supported since torch_npu 2.1.0. i mentions torch 2.7.1 just to tell the questioner what torch version in my env. another colleague use torch 2.5.1 to successllfully run this feature. |
This PR improves performance with small batches but degrades it with large batches, specifically during decoding.
Before this PR
After this PR
|
What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear forward funcion, this function use torch_npu.npu_mm_all_reduce_base to execute allreduce and matmul in a fused kernel way. this will gain a 20% performance
promotion in eager mode.
Does this PR introduce any user-facing change?
this PR introduce a new env
VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE
to control whether enable the feature or not.How was this patch tested?
the patch is tested by adding a new test file
test_patch_linear.py
to guard the ut